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^ . Abstract 

■ For memoryless sources, delayed side information at the decoder does not improve the rate-distortion 

\ function. However, this is not the case for sources with memory, as demonstrated by a number of works 

focusing on the special case of (delayed) feedforward. In this paper, a setting is studied in which the 
encoder is potentially uncertain about the delay with which measurements of the side information, 
which is available at the encoder, are acquired at the decoder. Assuming a hidden Markov model for 
the source sequences, at first, a single-letter characterization is given for the set-up where the side 
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, information delay is arbitrary and known at the encoder, and the reconstruction at the destination is 
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required to be asymptotically lossless. Then, with delay equal to zero or one source symbol, a single- 



letter characterization of the rate-distortion region is given for the case where, unbeknownst to the 
encoder, the side information may be delayed or not, and additional information can be received by the 



decoder when the side information is not delayed. Finally, examples for binary and Gaussian sources 
are provided. 
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I. Introduction 

Consider a sensor network in which a sensor measures a certain physical quantity Yi over time 
i = l,2,...n. The aim of the sensor is communicating a symbol-by-symbol processed version 
X"' = {Xi, Xn) of the measured sequence F" = (Yi, ...,¥„) to a receiver. As an example, 
each element Xi can be obtained by quantizing or denoising Yi, for i = 1,2, ...n. To this end, 
based on the observation of X" and F", the sensor communicates a message M of nR bits to the 
receiver {R is the message rate in bits per source symbol). The receiver is endowed with sensing 
capabilities, and hence it can measure the physical quantity as well. However, as the receiver 
is located further away from the physical source, such measure may come with some delay, say 
n + d for some d > 0. Assuming that at time n + i the decoder must put out an estimate Zi of 
the ith source symbol Xi by design constraints, it follows that the estimate Zi can be made to 
be a function of the message M and of the delayed side information y*^*^ = (Yi, Y^~'^) (see 
|[T| for an illustration). Following related literature (e.g., [2]), we will refer to d as the delay for 
simplicity. Delay d may or may not be known at the sensor. 

The situation described above can be illustrated schematically as in Fig. [T] for the case in which 
the delay d is known at the encoder. In Fig. [H the encoder ("Enc") represents the sensor and the 
decoder ("Dec") the receiver. The decoder at time i (more precisely, n + i) has access to delayed 
side information Y^~'^ with delay d. Fig. [2] accounts for a setting where the side information at 
the decoder, unbeknownst to the encoder, may be delayed by d or not delayed, where the first 
case is modelled by Decoder 1 and the second by Decoder 2. Note that, in the latter case, the 
receiver has available the sequence Y' = (Yi, Fj) at time i. For generality, in the setting in 
Fig. [2l we further assume that the encoder is allowed to send additional information in the form 
of a message A/a of nAR bits when the side information is not delayed. This can be justified 
in the sensor example mentioned above, as a non-delayed side information may entails that the 
receiver is closer to the transmitter and is thus able to decode an additional message of rate AR 
(bits/source symbol). 

A. Preliminary Considerations and Related Work 

To start, let us first assume that sequences X" and F" are memoryless sources so that the 
entries (Xj, Yi) are arbitrarily correlated for a given index i but independent identically distributed 
(i.i.d.) for different i = 1, ...,n. To streamline the discussion, the following lemma summarizes 
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Figure 1. Source coding with delayed side information at tlie decoder. Tlie side information is fully available at the encoder. 

the optimal trade-off between rate R and distortion D, as measured by a distortion metric d{x, z), 
for the point-to-point setting of Fig. \T\ with memoryless sources. Similar conclusions apply for 
the more general set-up of Fig. [21 

Lemma 1. /li]/, /E]/, For memoryless source, and zero delay, i.e., d = 0, the rate-distortion 
function for the point-to-point system in Fig.Ujis given by the conditional rate -distortion function 

R(D)= min I(X;Z\Y). (1) 

p{z\x,y):E[d{X,Z)] < D 

This result remains unchanged even if the decoder has access to non-causal side information, 
i.e., if the reconstruction Zi can be based on the entire sequence F", rather than only Y^. 
Instead, for strictly positive delay d > 0, the rate-distortion function is the same as if there was 
no side information, namely R{D) = minp(^z\xy. E[d{x,z)] < d 

Similar conclusions can be easily shown to apply also for the more general model of Fig. 
|2l as it will be discussed in the paper (see Sec. HVl) . Specifically, if d > and the sources are 
memoryless, the rate-distortion function for the system of Fig. 2 with AR = reduces to the 
one obtained by Kaspi in f6\ for a model in which decoder 1 has no side information, and, for 
general AR > , the rate-distortion region coincides with the one obtained in [7] for a model 
with no side information at decoder 1. 

We have seen in Lemma 1 that, for memoryless sources, no advantages can be accrued by 
leveraging a (strictly) delayed side information, i.e., with d > 0. However, this conclusion does 
not generally hold if the sources have memory. In this context, a number of works have focused 

'The first part of the Lemma is due to [jS], ID, while the second can be derived as in [S Observation 2], 
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Figure 2. Source coding where side information at the decoder may be delayed and additional information can be delivered 
when side information is not delayed. The side information is fully available at the encoder. 



on the scenario of Fig. [T] where Xj = Yi for i = 1, ...n. This entails that the decoder observes 
sequence X" itself, but with a delay of d symbols. This setting is typically referred to as source 
coding with feedforward, and was introduced in [[8]|. Reference derived the rate-distortion 
function for this problem (i.e., Fig. [T] with Xi = Yi) for ergodic and stationary sources in 
terms of multi-letter mutual informations. The result was also extended to arbitrary sources 
using information- spectrum methods. Achievability was obtained via the use of a codebook of 
codetrees. The function was explicitly evaluated for some special cases in |l9l, [[TTll (see also 
ifTOl '). and im proposed an algorithm for its numerical calculation. 

The more general case of Fig. [T] with ^ Y^ was studied in |2l assuming stationary and 
ergodic sources X" and F". The rate-distortion function was expressed in terms of multi-letter 
mutual informations. No specific examples were provided for which the function is explicitly 
computable. We finally remark that for more complex networks than the ones studied here, 
strictly delayed side information may be useful even in the presence of memory less sources. 
This was illustrated in [fT2l for a multiple description problem with feedforward. 

B. Contributions 

The goal of this work is to characterize the rate-distortion trade-offs for the setting in Fig. [T] 
and the more general set-up in Fig. |2] for a specific class of sources X" and F". Specifically, we 
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Figure 3. A graphical illustration of the assumed hidden Markov model for the sources. 

assume that is a Markov chain, and X" is such that Xi is obtained by passing Yi through 
a channel q{x\y) for i = I, ...,n, as illustrated in Fig. [31 The process is thus a hidden Markov 
model. This model complies with the type of sensor network scenarios described above, where 
y" is the physical quantity of interest, modelled as a Markov chain, and X" is a symbol- 
by-symbol processed version of F". The main contributions and the paper organization are as 
follows. After the description of the system model in Sec. HIl for the source statistics described 
above, 

• we derive a single-letter characterization of the minimal rate (bits/source symbol) required 
for asymptotically lossless compression in the point-to-point model of Fig. \T\ for any delay 
d > (Sec. IIII-A|) . Achievability is based on a novel scheme that consists of simple 
multiplexing/demultiplexing operations along with standard entropy coding techniques; 

• we derive a single-letter characterization of the minimal rate (bits/source symbol) required 
for lossy compression for the point-to-point model of Fig. \\\ and, more generally, for the 
model of Fig. [2] in which the side information may be delayed, for delays d = and d = I 
(Sec. lYl); 

• we solve a number of specific examples, namely binary-alphabet sources with Hamming 
distortion and Gaussian sources with minimum mean square error distortion, and present 
related numerical results (Sec. El). 



IL System Model 

We present the system model for the scenario of Fig. [2l As detailed below, the scenarios of 
Fig. [His obtained as a special case. The system is characterized by a delay d > 0; finite alphabets 
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X, y, Zi, Z2; conditional probabilities wi{a\b), with a,b ^ y, and q{x\y), with x E X and 

y E y (i.e., we have Xlaey ^i('^l^) ^ -'- ^'^^ ZlaGA" ^('^1^) ^ ^ ^ ^'^^ distortion 

metrics dj{x,y,Zj): X x y x Zj ^ [0,(imax], such that < dj{x,y,Zj) < dmax < 00 for all 
(x, 2;) G A* X 3^ X for j = 1, 2. As explained below, the subscript "1" in wi{a\b) indicates 
that wi(a\b) denotes one-step transition probabilities. 

The random process Fj E y, i E {...,— 1,0, 1,...}, is a stationary and ergodic Markov chain 
with transition probability Pr[Fj = a\Yi^i = b] = wi{a\b). We define the probability Pr[yi = 
a] = 7r(a) and also the /c-step transition probability Pr[Fj = ai\Yi_k = b] = Wk{a\b), which are 
both independent of i by the stationarity of Yi. These quantities can be calculated using standard 
Markov chain theory from the transition matrix associated with wi{a\b) (see, e.g., [|22l|). We also 
set, for notational convenience, wo{a\b) = 7r(a). Sequence F" = (Yi, ...,F„) is thus distributed 
as p{y'') = 7r(?/i) nr=2 integer n > 0. 

The random process Xi E X, i E {..., —1,0, 1, ...} is such that vector X" = (Xi, E 
A"", for any integer n > 0, is jointly distributed with so that 

n 

p(a;",y") = 7r{yi)q{xi\yi)Y[p{xi,yi\x'-\y'~^) 

i=2 
n 

= TT{yi)q{xi\yi)Ylwi{yi\y'~'^)q{xi\yi). (2) 

i=2 

In other words, process Xi E X, i E {..., —1,0, 1, ...} corresponds to a hidden Markov model 
with underlying Markov process given by Y^. 

We now define encoder and decoders for the setting of Fig.[2l Specifically, an {d, n, R, AR, Di, D2) 
code is defined by: (i) An encoder function 

f: (A"" X 3^") [1, 2"^] X [1, 2"^^], (3) 

which maps sequences X" and F" into messages M E [1,2"^^] and Ma G [1,2"^^^]; (//) a 
sequence of decoding functions for decoder 1 

gu: [1,2"^] xy-'^^Zi, (4) 

for i E [l,n], which, at each time i, map message M, or rate R [bits/source symbol], and the 
delayed side information F*"^ into the estimate Zu; (Hi) a sequence of decoding function for 
decoder 2 

g2^■. [1,2"^] X [1,2"^^] Xy (5) 
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for i G [l,n], which, at each time i, map messages M, or rate R, and Ma, of rate or rate 
AR, and the non-delayed side information into the estimate Z2i. In (IS])-©, for a,b integer 
with a < 6, we have defined [a,b] as the interval [a, a + l,...,b] with [a,b] = if a > foO 
Encoding/decoding functions ©-([S]) must satisfy the distortion constraints 



Note that these constraints are fairly general in that they allow to impose not only requirements 
on the lossy reconstruction of Xi or Yi (obtained by setting dj{x,y,Zj) independent of y or x, 
respectively), but also on some function of both Xi and Yi (by setting dj{x, y,Zj) to be dependent 
on such function of (x, y)). 

Given a delay d > 0, for a distortion pair (Di, D2), we say that rate pair (i?, Ai?) is achievable 
if, for every e > and sufficiently large n, there exists a (rf, n, i?, Ai?, Di + e, + e) code. 
We refer to the closure of the set of all achievable rates for a given distortion pair {Di, D2) and 
delay d as the rate-distortion region 7ld{Di, D2). 

From the general description above for the setting of Fig. [21 the special case of Fig. \\\ is 
produced by neglecting the presence of decoder 2, or equivalently by choosing D2 = dmax- 
In this case, the rate-distortion region lZd{Di, D2) is fully characterized by a function R^^Di) 
as 7^d(Di, dmax)= {(-R, A_R) : R > Rd{Di), AR > 0}. Function Rd{Di) hence characterizes 
the infimum of rates R for which the pair (Di,dmax) is achievable, and is referred to as the 
rate-distortion function for the setting of Fig. [TJ For the special case of the model in Fig. [2] in 
which AR = 0, we define the rate-distortion function Rd{Di, D2) in a similar way. 

Notation: For a, integer with a < b, we define = (xa,---,Xb); if instead a < 6 we set 

= 0. We will also write for for simplicity of notation. Given a sequence x" = [xi, x„] 
and a set X = {ii, ...,i\x\} C [l,n], we define sequence x-^ as x-^ = [xj^, Xjj, Xji^-J where 
ii < ... < i\x\. Random variables are denoted with capital letters and corresponding values with 
lowercase letters. Given random variables, or more generally vectors, X and Y we will use the 
notation px{x) or p{x) for Pr[X = x], and Px\Y{x\y) or p{x\y) for Pr[X = x\Y = y], where 
the latter notations are used when the meaning is clear from the context. Given set X, we define 
A*" as the n-fold Cartesian product of X. We denote any function of e > that tends to zero 

^As it is standard practice, 2"^ and 2"^^ are implicitly considered to be rounded up to the nearest larger integer. 




(6) 



i=l 
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as e — )• as (5(e) — )• 0. When referring to e— typical sequences, we refer to the notion of strong 
typicality as treated in [fT4ll . 

III. Point- TO-PoiNT Model 
In this section, we study the point-to-point model in Fig. [TJ 

A. Lossless Compression 

We start by characterizing the rate-distortion function R(i{Di) for any delay d > under 
the Hamming distortion metric for Di = 0. The Hamming distortion metric is defined as 
di{x,y,zi) = l{x 7^ zi), where 1(a) = 1 if a is true and 1(a) = otherwise. This implies 
that the distortion constraint ^ for j = 1 becomes 

^ n 1 " 

- Ve[1(X, ^ Zu)] = - VPr[X, ^ Zu] = 0. (7) 
n ^-^ n ^-^ 

In other words, from the definition of achievability given above, we impose that the sequence 
X" be recovered with vanishingly small average symbol error probability as n — )■ oo. We refer 
to this scenario as asymptotically lossless, or lossless for short. 
We have the following characterization of i?(i(0). 

Proposition 1. For any delay (i > 0, the rate-distortion junction for the set-up in Fig. [7] under 
Hamming distortion at Di = is given by 

RM = H{Xd+i\Xi,Y^), (8) 

where the conditional entropy is calculated with respect to the distribution 

PiVuXi) = n{yi)q{xi\yi) for d = 0, (9) 

d+l 

and p{yi,X2, ...,Xd+i) = ^{yi) ^ J]^ ti;i(2/i|2/i_i)g(a;i|2/i) /or d > 1. (10) 

y,<^y i=2 

ie[2,d+i] 

The proof of converse of the proposition above is based on an appropriate use of the Fano 
inequality and is reported in Appendix|Al To prove the direct part of the proposition, we propose a 
simple achievable scheme, which, to the best of the authors' knowledge, has not appeared before, 
in Sec. UlTBl 
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Remark 1. Expression ([8]) consists of a conditional entropy ofd+1 random variables, namely 
Yi,X2, Xd+i- These variables are distributed as the corresponding entries in the random vectors 
X" and F", as per (|9l)-(fT0l) (cf. Q). We have therefore used the same notation for the involved 
random variables as in Secim Proposition 1 provides a "single-letter" characterization of RdiO) 
for the setting of Fig. [H since it only involves a finite number of variable]^. This contrasts with 
the general characterization for stationary ergodic processes of Rd{D) given in [0, which is 
a "multi-letter" expression, whose computation can generally only attempted numerically using 
approaches such as the ones proposed in BU. Note that a multi-letter expression is also given in 
[[TT| to characterize Rd{D) for i.i.d. sources with negative delays d < 0. Finally, it should be 
emphasized that the simple characterization dS]) for the scenario of interest here hinges on the 
assumed statistics of the sources (X", F"). 

Remark 2. By setting d = in ([8]) we obtain -Ro(O) = H(Xi\Yi). This result generalizes IfTTl 
Remark 3, p. 5227] from i.i.d. sources (X", F") to the hidden Markov model Q considered 
here. Note that, for = 1, we instead obtain -Ri(O) = H{X2\Yi). As another notable special 
case, if side information is absent, or equivalently d — t- oo, in accordance to well-known results, 
we obtain that -Roo(O) equals the entropy rate (see, e.g., H13II ) 

H{X)^ Urn H{Xu..,Xn). (11) 

n— >oo 

In fact, we have 

i?oo(0) = lim H{Xd+i\X^,Yi) = H{X) (12) 

by O Theorem 4.5.1]. 

Remark 3. Is delayed side information useful (when known also at the encoder)? That this is 
generally the case follows from the inequality 

RM = H{Xd+i\X^,Y,) < R^{0) = H{X), (13) 

since -Roo(O) is the required rate without side information. This result is proved by the chain 
of inequalities H{Xd+i\X2,Yi) < H{Xd+i\Xf) < H(X), where the first inequality follows 
by the data processing inequality and the second by conditioning reduces entropy. However, 
inequality (fT3l) may not be strict, and thus side information may not be useful. A first example 

^It might be more accurately referred to as a "finite-letter" characterization. 
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is the case where Xj is an i.i.d. process, which is obtained by making q{x\y) independent of 
y. As another example, consider the setting of source coding with feedforward |[8l, Q, i.e., 
Xi = Yi. In this case, our assumption ^ entails that X" is a Markov chain, and we have 
R^(0) = H{Xd+i\Xf) = H{X2\Xi) = H{X) for d>l. Therefore, delayed feedforward (with 
d > 1) is not useful for the lossless compression of Markov chains, as already shown in [8J. 
This conclusion need not hold for lossy compression (i.e., for -Di > 0) [8J (see also Sec. IV-AI) . 

Remark 4. If X", F" are general jointly stationary and ergodic processes (and not necessarily 
stationary ergodic hidden Markov models), one can adapt in a straightforward way the proofs 
of Appendix |A] and Sec. IIII-Bl and conclude that the rate distortion function can be written as 



where i7(X"| jF"-'^) is the causally conditioned entropy i7(X"| = ^"^^ H{Xi\X'-^Y'-^) 

(see, e.g., [|24|fl Comparing (fT4l) with the rate -Roo(O) = H{X) necessary in the absence of any 
side information, we conclude that the reduction in the compression rate obtained by leveraging 
delayed side information at the decoder, when side information is known at the encoder, is given 
for stationary and ergodic processes by 



In (fT5l) . we have used the definition of directed mutual information I(Y'^~'^ X") = if(X") — 
H(X^\\Y^^'^) (see, e.g., [|24l ). Note that the rate gain (fT5l) complements the results given in 
||24]| on the interpretation of the directed mutual information (see also next remark). 

Remark 5. Consider a variable-length (strictly) lossless source code that operates symbol by 
symbol such that, for every symbol i E it outputs a string of bits Mj(X*, F*"'^), which 

is a function of X* and F*^"^. Encoding is constrained so that the code Mi(x\y^~'^) for each 
(x*,y*~'^) is prefix-free. The decoder, based on delayed side information, can then uniquely 
decode each codeword Mj(a;*, y*""') as soon as it is received. Following the considerations in 
Il24l Sec. IV], it is easy to verify that rate -Rd(O) (and, more generally, (fT4l) ) is also the infimum 
of the average rate in bits/source symbol required by such code. Moreover, it is possible to 
construct universal context-based compression strategies by adapting the approach in [|25l . 

''The limit exists because the sequence is non-increasing and bounded below. 




(14) 



n—^oo n 



R^O) - Ra{D) = lim ^ X"). 



(15) 
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(b) 

Figure 4. A block diagram for encoder (a) and decoder (b) used in the proof of achievability of Proposition 1. 

We refer to Sec. |V]for some examples that further illustrate some implications of Proposition 

1. 

B. Proof of Achievability for Proposition 1 

Proof: (Achievability) Here we propose a coding scheme that achieves rate ([8]). The basic 
idea is a non-trivial extension of the approach discussed in [[TTl Remark 3, p. 5227] and is 
described as follows. A block diagram is shown in Fig. |4] for encoder (Fig. |4]-(a)) and decoder 
(Fig. SKb)). 

We first describe the encoder, which is illustrated in Fig.|4]-(a). To encode sequences (x", y") G 
(A*" X 3^"), we first partition the interval [1, n] into | A:'|'^^^|3^| subintervals, which we denote as 
X(x'^~'^,y) C [l,n], for all x^^^ E X'^^^ and y E y. Every such subinterval X(x^^^, y) is defined 
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Figure 5. An example that illustrates the operations of the "Demux" block of the encoder used for the achievability proof of 
Proposition 1, as shown in Fig. |4l for d — 2 (symbols corresponding to out-of-range indices are set to zero). 



as 

I{x''-\y) = {t:te [l,n] and y^^, = y, x^^^, = x''-'}. (16) 

In words, the subinterval Z{x'^^^,y) contains all symbol indices i such that the corresponding 
delayed side information available at the decoder is y^.^ = y and the previous d — 1 samples 
in x" are x^Zd+i = x'^~^. We refer to the value of the tuple as the context of 

sample x^ For the out-of-range indices i G [—d + 1,0], one can assume arbitrary values for 
Xi E X and yi E y, which are also shared with the decoder once and for all. Note that 
|J-d_ig^d_i -^yX(x^^^,y) = [l,n]. Fig. |5] illustrates the definitions at hand for d = 2. 

As a result of the partition described above, the encoder "demultiplexes" sequence into 
lA'I'^^^lJ^I sequences x-^^^'^ one for each possible context (x'^~^,y) G X'^^^xy. This demul- 
tiplexing operation, which is controlled by the previous values of source and side information, 
is performed in Fig. |4l-(a) by the block labelled as "Demux", and an example of its operation is 

^For the feedforward case Xi — Yi, this definition of context is consistent with the conventional one given in 1201 when 
specialized to Markov processes. See also Remark |5] 
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shown in Fig. [51 By the ergodicity of process Xj and Fj, for every e > and all sufficiently large 
n, the length of any sequence x-^^^'' ^'^^ is guaranteed to be less than npY^X2,...,Xd{y^^'^~^) + ^ 
symbols with probability arbitrarily close to one. This because the length |X(5;°'~^, y)| of the 
sequence x-^^^'' ^'^^ equals the number of occurrences of the context (yi-d = y, = x'^^^) 

and by Birkhoff's ergodic theorem (see [|13i Sec. 16.8]). In particular, for any e > we can find 
an n such that 

Pr[£i(^,£^-i)]<^^^^, (17) 
where we have defined the "error" event 

SS,x'-') = {\I{x'-\y)\ > npY,x,,...,xAy,S^''') + e}- (18) 

Each sequence x-^^^"^ ^'^^ is encoded by a separate encoder, labelled as "Enc" in Fig. |4l-(a). 
In case the cardinality \I(x'^^^,y)\ does not exceed npYiX2,...,Xaiyj ^^~^) + ^ (i-^-' th^ "error" 
event Si{y,x'^^^) does not occur), the encoder compresses sequence x'^''^'^ ^'^^ using an entropy 
encoder, as explained below. If the cardinality condition is instead not satisfied (i.e., £i{y,x'^^^) 
is realized), then an arbitrary bit sequence of length L^(y, x"^^^), to be specified below, is selected 
by the encoder "Enc". 

The entropy encoder can be implemented in different ways, e.g., using typicality or Huffman 
coding (see, e.g., [[T3l ). Here we consider a typicality-based encoder. Note that the entries Xj of 
each sequence X-^^^'^ ^'^^ are i.i.d. with distribution pXd+i|YiX2,...,Xd('|y) ^i'^"^), since conditioning 
on the context {yi-d = y, a^'I^+i = x^''^^} makes the random variables Xj independent. As 
it is standard practice, the entropy encoder assigns a distinct label to all e-typical sequences 
'^(PXd+i|yiX2,...,Xd('|y) 5;"'^^)) with respect to such distribution, and an arbitrary label to non- 
typical sequences. From the Asymptotic Equipartion Property (AFP), we can choose n suffi- 
ciently large so that (see, e.g., |[T4l ) 

PAS2{y,x'-')]<^^^^^, (19) 

where we have defined the "error" event 

£2iy,x'^-') = {x^(^'-^'S) ^ UPx,,,\nx2,...,xA-\y,^'-'m- (20) 

Moreover, by the AFP, a rate in bits per source symbol of H{Xd+i\X2 = x'^'^, Yi = y) + e is 
sufficient for the entropy encoder to label all e-typical sequences. 
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From the discussion above, it follows that the proposed scheme encodes each sequence 
^x{x'i-\y) ^iji^ L,{y,i'^-^) = npY^X2,...,xAy^^'^~^)H{Xd+i\X^ = x'^-^Yi = y) + n5{e) bits. 
By concatenating the descriptions of all the jA'I'^^^lJ^I sequences x^^^"^ ^'^^^ we thus obtain that 
the overall rate R of message M for the scheme at hand is H{X(i+i\X2^^ ,Yi) + 5{e). The 
concatenation of the labels output by each entropy encoder is represented in Fig. |4l-(a) by the 
block "Mux". We emphasize that encoder and decoder agree a priori on the order in which the 
descriptions of the different subsequences are concatenated. For instance, with reference to the 
example in Fig. [5] (with d = 2), message M can contain first the description of the sequence 
corresponding to {x,y) = (0,0), then {x,y) = (0, 1), etc. 

We now describe the decoder, which is illustrated in Fig. |4l-(b). By undoing the multiplexing 
operation just described, the decoder, from the message M, can recover the individual sequences 
x^i^'' ^'V) through a simple demultiplexing operation for all contexts G X'^~'^ x y. This 

operation is represented by block "Demux" in Fig. |4l-(b). To be precise, this demultiplexing is 
possible, unless the encoding "error" event 

S= U {^i(y,£^-i)U^2(y,5'-')} (21) 

takes place. In fact, occurrence of the "error" event £ implies that some of the sequences x^^^"^ ^'^^ 
was not correctly encoded and hence cannot be recovered at the decoder. The effect of such 
errors will be accounted for below. 

Assume now that no error has taken place in the encoding. While the individual sequences 
xAx"^ ^,y) can be recovered through the discussed demultiplexing operation, this does not imply 
that the decoder is also able to recover the original sequence x". In fact, that decoder does not 
know a priori the partition {X{x'^^^ ,y): x'^^^ E X'^^^ and y G 3^} of the interval \l,n] and thus 
cannot reorder the elements of sequences x^^"*"* ^'^^ to produce x". Recall, moreover, that such 
re-ordering operation should be done in a causal fashion following the decoding rule dH). 

We now argue that the re-ordering mentioned above is in fact possible using a decoding rule 
that complies with dH) via a multiplexing block controlled by the previous estimates of the source 
samples (block "Mux" in Fig. |4l-(b)). In fact, note that at time i, the decoder knows Yi^d and the 
previously decoded X*^^ and can thus identify the subinterval Z{x^~^,y) to which the current 
symbol Xj belongs. This symbol can be then immediately read as the next yet-to-be-read symbol 
from the corresponding sequence x-^^^'' Note that for the first d symbols, the decoder uses 
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the values for Xi and i/i at the out-of-range indices i that were agreed upon with the encoder 
(see above). In conclusion, we remark that the scheme described above, by choosing e small 
enough and n large enough, is able to satisfy the constraint (|7]) to any desired accuracy. We also 
note that the controlled multiplexing/demultiplexing operation used in the proof is reminiscent 
of the scheme proposed in [|26ll for transmission on fading channels with side information at the 
transmitter and receiver. 

We finally need to study the effect of errors. Given the choices made above, we have that the 
probability of an encoding error is 

Pr[^] < J2 PA^iiy,S:'-')] + PT[£2iy,x''-')] < e, (22) 

where the first inequality follows from the union bound and the second from (fTTI) and (fT9l) . This 
implies that the distortion in (|7]) is upper bounded by e as desired. In fact, from the definition of 
encoder and decoder given above, we can conclude that Pr[X" ^ Z^] = Fi[£] < e, where we 
recall that is the sequence reconstructed at the decoder. Moreover, the following inequality 
holds in general 

Pr[X" ^ Zf] > ^ Pr[X, 7^ Z^i\. (23) 

i=l 

n 

Therefore, we have ^^^Pr[Xj 7^ Zu] < e, which concludes the proof. ■ 

i=l 

Remark 6. An alternative proof of achievability can be given by using the idea of codetrees and 
extending the notions of typicality introduced in yj. The proof discussed above is based on a 
conceptually and algorithmically simpler approach, albeit its applicability is limited to lossless 
compression (see next subsection). 

Remark 7. From the inequality (|23l) . it follows that the optimality of the scheme above can be 
proved also under the more stringent block error probability constraint (see also [14, Sec. 3.6.4]). 

C. Lossy Compression 

Here, we obtain a characterization of the rate-distortion function R^^Di), for c? = and d = 1. 
The proof follows as a special case of that of Proposition |4] to be discussed in the next section, 
and is based on similar arguments as for Proposition [B 
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Proposition 2. For any delay d > and distortion Di, the following rate is achievable for the 
setting of Fig. [7] 

R^^\d,) = mmI{XY- Z^\Y^), (24) 

with mutual informations evaluated with respect to the joint distribution 

p{x, y, yd, zi) = TT{yd)wdiy\yd)qix\y)p{zi\x, y, yd), (25) 

and where minimization is done over all conditional distributions p{zi\x,y,yd) such that 

E[d,{X,Y,Z^)]<D^. (26) 

Moreover, rate ^24\l- l^2EI) is the rate-distortion function, i.e., R^f^{Di) = Rd{Di), for d = and 
d=l. 

Remark 8. The optimality of the conditional codebook strategy for lossless compression shown in 
Proposition [U hinges on the following fact: conditioned on the context (li_d, Xi^d+i, ■ ■ ■ , Xi^i), 
the samples Xj are independent of the past samples X^~^ by the hidden Markov model assump- 
tion. Recall that the fact that the decoder has available the past source samples {Xi_d+i,. . . , Xj_i) 
since its estimates are correct with high probability. Due to this independence property, and to 
the availability of the side information also at the encoder, the latter need not use "multi-letter" 
compression codes and can instead use simple "single-letter" entropy codes conditioned on the 
values of (l^_rf, . . , Xj_i) without loss of optimality. In the lossy case considered in 

Proposition [2l instead, even for the point-to-point model, the independence condition discussed 
above does not hold for delays d strictly larger than 1. In fact, at each time i, the decoder has 
available the delayed side information V^^^ only, conditioned on which the source samples Xj 
are not independent of the past samples X'^^. But, for d = 1, the independence condition at hand 
does apply and thus the optimality of "single-letter" codes can be proved as done in Proposition 

m 

IV. When the Side Information May Be Delayed 

In this section, we consider the problem of lossy compression for the set-up of Fig. |2l Note that 
the asymptotically lossless case follows from Proposition [U since, in order to guarantee lossless 
reconstruction also at the decoder with delayed side information, rate R must satisfy the con- 
ditions in Proposition [B Here, we obtain an achievable rate region v}f' {Di, D2)'^ TZd{Di, D2) 
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for all delays d > for the model in Fig. [2l and show that such region coincides with the 
rate-distortion region, i.e., n'-;\Di, D2)= TZdiDi, D2), for = and = 1. 

To streamline the discussion, we start by consider the special case where AR = and obtain 
a characterization of the rate-distortion function Rd{Di, D2) for (i = and d = 1. 

Proposition 3. For any delay (i > and distortion pair {Di, D2), the following rate is achievable 
for the set-up of Fig. \2\ with AR = 

R^^\d^,D2) = minliXY; Zi\Yd) + I{X; Z^lYY^Zi) (27) 
= min/(r; Zi\Yd) + I{X; Z^Z2\YYa), (28) 

with mutual informations evaluated with respect to the joint distribution 

p{x,y,yd,zi,Z2) = ■K{jjd)wd{y\yd)q{,x\y)p{zi,Z2\x,y,yd), (29) 

and where minimization is done over all conditional distributions p{zi, Z2\x,y,yd) such that 

E[d,iX, Y, Z,)] < D,, for j = 1, 2. (30) 

Moreover, rate ^I7\i-^2M is the rate-distortion function, i.e., R^^\Di, D2) = Rd{Di, D2), for 
d = and d = 1. 

Remark 9. Rate (|27l) can be easily interpreted in terms of achievability. To this end, we remark 
that variable Yd plays the role of the delayed side information Y'^'^ at decoder 1. The coding 
scheme achieving rate (|271) operates in two successive phases. In the first phase, the encoder 
encodes the reconstruction sequence for decoder 1 . Since decoder 1 has available delayed side 
information, using a strategy similar to the one discussed in Sec. IIII-B[ this operation requires 
I{XY; Zi\Yd) bits per source sample, as further detailed in Sec. IIV-A[ Note that decoder 2 is 
able to recover Z"' as well, since decoder 2 has available side information Y\ and thus also 
the delayed side information In the second phase, the reconstruction sequence Z2 for 

decoder 2 is encoded. Given the side information available at decoder 2, this operation requires 
rate /(X; Z2\YYdZi), using again an approach similar to the one discussed in Sec. IIII-B[ The 
converse proof is in Appendix El 

Remark 10. For memoryless sources X" and F", obtained by setting the transition probability 
wi{yi\y^^^) to be independent of it can be seen that the achievable rate (IT7l)-(l28l) is the 
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rate-distortion function for the scenario of Fig. [2] with AR = for all delays d > 0. This 
observation extends Lemma 1 to the more general set-up of Fig. |2] with AR = 0. To see this, 
note that for d > 1, rate (|27])-(|28l) is given by 

r'';\d,,D2) = minI{XY; Z^) + I{X; Z2\YZ,), (31) 

with mutual informations evaluated with respect to the joint distribution 

p{x, y, yd, zi, Z2) = TT{y)q{x\y)p{zi, Z2\x, y), (32) 

and where minimization is done over all conditional distributions p{zi, Z2\x, y, yd) such that the 
distortion constraints (|30l) are satisfied. Rate (|3T1) recovers the rate-distortion function derived 
by O for the case where decoder 1 has no side information. Therefore, rate (|3T]) is achievable 
even without any state information at decoder 1 . We then conclude that delayed side information 
is not useful for memoryless sources. Note also that f6] assumes non-causal availability of the 
side information at decoder 2. The equality of the rate derived in |l6l and the one in Proposition 
[3] thus demonstrates that causal and non-causal side information lead to the same performance 
in terms of rate-distortion function. 

Remark 11. While (l27l) is easier to interpret in terms of achievability as done in Remark |9l 
the equivalent expression (|28l) highlights the rate loss due to the possible delay of the side 
information. In fact, the mutual information /(X; ZiZ2\YYd) accounts for the rate that would be 
needed to convey both and Z2 only to decoder 2, which has non-delayed side information. 
Therefore, the additional term I{Y; Zi\Yd) can be interpreted as the extra rate that needs to be 
expended to enable transmission of also to decoder 1, which has delayed side information. 

We now consider the general model in Fig. [2l 
Proposition 4. For any delay d > and any distortion pair (Di,D2), define TI[^\Di, D2) as 



the union of all rate pairs (R, AR) that satisfy 

R > I{Y; Zi\Yd) + I{X; Z^U\YYd) (33) 

R + AR> I{Y; Z,\Yd) + I{X; Z,Z2U\YYd) (34) 

for some joint distribution 

p{x, y, yd, u, zi, Z2) = Ti{yd)wd{y\yd)q{,x\y)p{zi,Z2, u\x, y, yd) (35) 
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where minimization is done over all conditional distributions p{zi, Z2,u\x,y,yd) such that 

E[rf,(X, Y, Z^)] < D,, for j = 1, 2. (36) 

We have that 

n^;\D,,D2)cn,{D^,D,) (37) 

for any d > 0. Moreover, equation 07l) holds with equality, and thus TZ^^\Di, D2) is the rate- 
distortion region, for d = and d = 1. 

Remark 12. Let us interpret the rate region 71^^\Di, D2) in terms of achievability. First, from 
Remark [9l we observe that (1331) is the rate necessary to convey to both decoder 1 and 
decoder 2, and an auxiliary codeword f/" only to decoder 2. This auxiliary codeword f/" carries 
information to decoder 2 that is then refined via message Ma. In particular, rewriting (|34|) as 
R + AR> I{Y; Zi\Yd) + /(X; ZiU\YYd) + /(X; Z2\YYdUZi), by comparison with ([33]), we 
see that the extra rate I{X] Z2\YYdU Zi) is needed to transmit sequence Z2 to decoder 2, thus 
refining the information available therein due to message M@ 

Remark 13. The considerations in Remark [TOl can be also easily extended to the scenario of 
Proposition H with Ai? > 0. 

A. Proof of Achievability of Proposition 12 and Proposition |?] 

Proof: (Achievability) We first prove achievability of rate (1271) in Proposition [31 The proof 
extends the ideas discussed in Sec. IIII-Bl to which we refer for details. In particular, here we do 
not detail the calculations of the encoding "error" events and distortion levels, as they follow in 
the same way as in Sec. IIII-BI To encode sequence (a;", y"), the encoder partitions the interval 
into |3^| subintervals, namely X(y) for each y G 3^, so that (cf. ([T6l) ) 

X(y) = {i: i e and = y}. (38) 

Similar to Sec. ini-B[ a different compression codebook is used for each such interval X[y), and 
thus for each pair of "demultiplexed" subsequences (a;-^(^\ V'^^^^)- The compression of each pair 
of sequences (x-^'^^^ y-^*^^) ) is based on a test channel p{zi\x, y, y). Specifically, the corresponding 

*Note that such rate can be encoded in both messages M and Ma, which leads to the sum-rate constraint l l34b . 
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codewords are generated i.i.d. according to the marginal distribution j/)ey ^'('^il^' ^' ^) 
Wi{y\y)q{x\y) and compression is done based on standard joint typicality arguments. By the 
covering lemma lfT4l . compression of sequences (^X-^^^\Y-^^y^) into the corresponding recon- 
struction sequence zf^^'^ requires rate I{XY; Zi\Y = y) + e bits per source symbol in each 
interval I{y), and thus an overall rate I{XY; Zi\Y) + e following the same considerations as in 
Sec. IIII-BI In particular, the encoder multiplexes the compression indices corresponding to the 
1 3^ I intervals T{y) to produce message M. Therefore, the latter only carries information about the 
individual sequences Z'^^^^ , but not about the ordering of each entry within the overall sequence 
Z^. 

Based on the sequence produced in the first encoding phase described above, the encoder 
then performs also a finer partition of the interval [l,n] into |3^p|Zi| intervals Z{y,y,z), with 
y & y, y & y, and z e Z, so that 

I{y, y, z) = {tie [1, n] and yi^d = y, Vi = y, and Zi = z). (39) 

Compression of sequence a;^(*'?''^) into the corresponding reconstruction z^^^'^'^^ is carried out 
according to test channel p{z2\x, y, y, z) as per the discussion above, requiring an overall rate of 
/(X; Z2\YY Zi) + e. The compression indices for all sets X(y, y, z) are concatenated in message 
M following the compression indices obtained from the sets X{y). 

Upon reception of message M, decoder 1 and 2 can both recover the sequences and 
^Ay,y,^) ^Qj. ^\ y ^ y ^ y ^ y z E Z via simple demultiplexing. Moreover, following the 
same reasoning as in Sec. IIII-B[ decoder 1 can reconstruct sequence in the correct order in 
a causal fashion, using a decoder dH), which depends on message and delayed side information, 
since the value of Zu can be obtained from sequences Z^^^^ by knowing the value of Fj-d- 
Similarly, decoder 2 can reorder sequence Z2 in a causal fashion using a decoder of the form 
([5]). This concludes the proof of achievability for Proposition [31 ■ 

We now turning to the proof of achievability Proposition |4l For a fixed distribution (l35l) . we 
need to prove that the rate region in Fig. [6] is achievable. To do this, it is enough, by standard 
time-sharing arguments, to prove that corner points A and B are achievable. Comer point B 
corresponds to rate pair R = I{Y; Zi\Yd) + /(X; ZiZ2U\YYd) and Ai? = 0. But achievability 
of this region follows immediately from Proposition [3] by setting U = (U Z2) in (|27l) . Instead, 
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I{X;Z,\UYY,Z,) 




I(Y-Z,\Y,) + I{X-ZfJ\YY,) \ R 

I{Y-Z,\Y,) + IiX-Z,Z,U\YY,) 



Figure 6. Achievable rate region used in the proof of Proposition |4] 



comer point A corresponds to the rate pair 

R = I{Y-Z,\Ya) + I{X-Z^U\YY,) 
and Ai? = /(X; ZipYYaZi). 



(40) 
(41) 



This rate pair can be achieved by using a strategy similar to the one discussed above. In this 
strategy, when encoding the message Ma, which is received only at decoder 2, the encoder 
leverages the fact that the latter knows Fj, f/j and Zu, by appropriately partitioning the 
interval and using different test channels in each subinterval. ■ 

V. Examples 

In this section, we consider two specific examples relative to the scenario in Fig. [IJ The 
first example consists of binary-alphabet sources, while the second applies the results derived 
above to (continuous-alphabet) Gaussian sources. We focus on a distortion metric of the form 
di(x,?/, zi) = di{x,zi) that does not depend on y. In other words, the decoder is interested in 
reconstructing X" within some distortion Di. We note that, under this assumption, the rate ^ 
equals the simpler expression 

R^;^\Di) = m.mI{X-Z^\Yd), (42) 
with mutual informations evaluated with respect to the joint distribution 

p{x, yd, zi) = 7i{yd) (j2yey^d{y\yd)qix\y)] p{zi\x, ya), (43) 
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where minimization is done over all distributions p(zi\x, yd) such that E[di(X, Zi)] < Di. Note 
that this simplification is without loss of optimality because the distortion constraint does not 
depend on the correlation between Zi and Y. Therefore, we can impose the Markov condition 
Zi — XYfi — F as in (l42l) without changing the distortion, while reducing the mutual information 
in 

A. Binary Hidden Markov Model 

In the first example, we assume that Yi is a binary Markov chain with symmetric transition 
probabilities t(7i(l|0) = ti'i(0|l) = e. Therefore, we have 7r(l) = 1/2 and A;-step transition 
probabilities Wfc(l|0) = Wk{0\l) = 6^^\ which can be obtained recursively as e^^^ = e and 
^(k) ^ 2e^^'^\l - e^^'^'^) for A; > 20 Note that this is a logistic map such that e^^'> -^1/2 for 
large k. We also set e*^^^ = 0, consistently with the convention adopted in the rest of the paper. 
Finally, we assume that 



with "©" being the modulo-2 sum and Ni being i.i.d. binary variables, independent of y", with 
PAr^(l) = g, g < 1/2. We adopt the Hamming distortion di{x,zi) = x ® zi. 

We start by showing in Fig. |7]the rate Rd{^) obtained from Proposition 1 corresponding to 
zero distortion {Di = 0) versus the delay d for different values of e and for q = 0.1. Note 
that the value of e measure the "memory" of the process Y^: For e small, the process tends 
to keep its current value, while for e = 1/2, the values of Y^ are i.i.d.. For c? = 0, we have 
Ro{0) = H(Xi\Yi) = Hh{q) = 0.589, irrespective of the value of e, where we have defined the 
binary entropy function Hb{a) = —a logg a — (1 — a) log2(l — a). Instead, for d increasingly large, 
the rate -Rrf(O) tends to the entropy rate -Roo(O) = H{X). This can be calculated numerically to 
arbitrary precision following [|T3l Sec. 4.5]. Note that a larger memory, i.e., a smaller e leads to 
smaller required rate -Rd(O) for all values of d. 

Fig.[8]shows the rate -Rd(O) for e = 0.1 versus q for different values of d. For reference, we also 
show the performance with no side information, i.e., -Roo(O) = H(X). For q = 1/2, the source X" 



(44) 



^This follows from the standard relationship 



1 - 




1 -e 



l-£ 



£ 



, well known from Markov 



chain theory (see, e.g., t221 ). 
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is i.i.d. and delayed side information is useless in the sense that Rd{0) = -Roo(O) = H{Xi) = 1 
(Remark [3l). Moreover, for g = 0, we have Xi = Yi, so that Xj is a Markov chain and the problem 
becomes one of lossless source coding with feedforward. From Remark |3l we know that delayed 
side information is useless also in this case, as -Rd(O) = -Roo(O) = H(X) = Hh^e) = 0.469. For 
intermediate values of q, side information is generally useful, unless the delay d is too large. 

We now turn to the case where the distortion Di is generally non-zero. To this end, we evaluate 
the achievable rate (|42)) in Appendix O obtaining 

R^\D^) = H.ie^''^ * q) - H,{D,) (45) 

for 

< < min{£(°') * g, 1 - e^'^'^ * g}, (46) 

and R[^\Di) = otherwise. In (|45])-(|46l) we have defined p * q = p{l — g) + (1 — p)q. Recall 
that rate R^^\Di) has been proved to coincide with the rate-distortion function Rd{Di) only for 
ci = and = 1 (Corollary O. 

As a final remark, we use the result derived above to discuss the advantages of delayed side 
information. To this end, set g = so that Xi = Yi and the problem becomes one of source 
coding with feedforward. For d = 1, result (I45l)-(l46l) recovers the calculation in fS, Example 2] 
(see also [9]), which states that the rate-distortion function for the Markov source X" at hand 
with feedforward (d = 1) is 

R,{D) = H,{e) - Hh{D,) (47) 

for Di < min(e, I — e) and Ri{Di) = otherwise. From lfT9l (see also f2T\). it is known that 
the rate-distortion function of a Markov source X" without feedforward, i.e., i?oo(-Di), is equal 
to (l47l) only for Di smaller than a critical value, but is otherwise larger. This demonstrates that 
feedforward, unlike in the lossless setting discussed above, can be useful in the lossy case for 
distortion levels Di sufficiently large, as first discussed in |[8l. 

B. Hidden Gauss-Markov Model 

We now assume that Y'^ is a Gauss-Markov process with zero-mean, power -E[y^^] = 1 and 
correlation E[FiFj_|_i] = p (so that E[FjFi+rf] = p"'). Moreover, Xi is related to Yi as 

Xi = Yi + Ni, (48) 
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d 

Figure 7. Minimum required rate -R£j(0) for lossless reconstruction for the set-up of Fig. [T] with binary sources versus delay 
d (g = 0.1). 




Figure 8. Minimum required rate Rd{0) for lossless reconstruction for the set-up of Fig. [T] with binary sources versus parameter 
q{e = 0.1). 
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where samples Ni are i.i.d. zero-mean Gaussian with variance a% and independent of F". We 
concentrate on the mean square error distortion metric di(a;,2;i) = (x — Zi^. Using standard 
arguments, we can apply the achievable rate (|42l) to the setting at hand, although the result was 
derived for discrete alphabet (see [[T4l Ch. 3.8]). By doing so, as shown in Appendix iDl we get 
that the following rate is achievable for d > 



if < -Di < 1 — p^'^ + cr^ and W^\Di) = otherwise. As also discussed above, this rate 
coincides with the rate-distortion function for d = and d = 1. 

Similar to the discussion in the previous section for a binary hidden Markov model, we 
remark that for aj^ = 0, the problem becomes one of lossy source coding with feedforward of 
a Gauss-Markov process X". In this case, it is known that the rate-distortion function without 
feedforward, Roo{Di), equals |log2 (^^7~) ^^^y distortions Di smaller than a critical value 
lfT9l and is otherwise larger. By comparison with (|49|) . it then follows that feedforward, for 
sufficiently large distortion levels, can be useful in decreasing the rate-distortion function. 



The problem of compressing information sources in the presence of delayed side information 
finds application in a number of scenarios including sensor networks and prediction/denoising. 
A general information-theoretic characterization of the trade-off between rate and distortion for 
this problem can be generally given in terms of multi-letter expressions, as done in Such 
expressions are proved by resorting to complex achievability schemes that operate in increasingly 
large blocks, and generally require involved numerical evaluations. In this work, we have instead 
focused on a specific class of sources, which evolve according to hidden Markov models, 
and derived single-letter characterizations of the rate-distortion trade-off. Such characterizations 
are established based on simple achievable scheme that are based on standard "off-the-shelf" 
compression techniques. Moreover, the analysis has focused not only for the conventional point- 
to-point setting of |l2l, but also on a more general set-up in which side information may or may 
not be delayed. The value of the derived characterization is demonstrated by elaborating on two 
examples, namely binary sources with Hamming distortion and Gaussian sources with minimum 
mean square error distortion. 




(49) 



VI. Concluding Remarks 
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Various extensions of the results presented here are possible. For instance, the optimal strategy 
for a cascade model with three nodes in which the intermediate node has causal side information 
and the end decoder has delayed side information Y^^^ can be identified by applying the 
result in Proposition [3] in a manner similar to [|27ll . 
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Appendix A 
Proof of Converse for Proposition 1 

For e > 0, fix a code (d, n, R, 0, e, rfmax) as defined in Sec. Ull Using the definition of encoder 
©, we have the equalities 

nR > H{M) = H{M) - H{M\X''Y'') 

= I{M; X"F") = /7(X"r") - /7(X"F"|M) (50) 



The first term in (1501) cam be written, using the chain rule for entropy, as 

d 



i=l 



+ [H{Yi_d\Y'-'^-^X'''^) + H{Xi\Y'-'^X'-^] 

i=d+l 

n 

+ J2 H(Yi\Y'~^X'') 

n 

= A+ J2 [HiYUy'-'-'X'-') + H{X,\Y,_,Xiz]^,)] (51) 



i=d+l 

where A = Yli=iH{Xi\X^~^) + ELn-d+i a finite constant that does not 

increase with n. Moreover, in the last line we have used the Markov chain Xj — (Fj-^X^rrf+i) — 
yi-d-ij^i-d^ which follows from ©. The second term in ^ can be similarly written as 



i7(X"F"|M) = B+ Y [H{Yi^d\Y'''^~^X'~^M) + H{Xi\Y''^X'-^ M) 

i=d+l 
n 

<B+ [H{Yi_d\Y'~'^~^X'-^) + H{Xi\Y'-'^M)\ , (52) 



i=d+l 
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where B = H{Xi\X'-W) + ELn-d+i H {Yi\Y'-^ X"" M) is a finite constant that does not 
increase with n. The inequality in (|52l) follows from conditioning reduces entropy. Note also 
that we have the inequality B < Ahy conditioning reduces entropy. 
By definition, a code (d, n, R, 0, e, (imax) rnust satisfy (cf. (|7])) 



e > 



^ n 1 " 

-VPe, > - V Pe,i, (53) 



n — ' n 

i=l i=d+l 



where we have defined Pe,i — Pr[Xj 7^ Zij]. It follows that 

n n 

J2 H{X,\Y'-^M) < J2 H{X,\Zu) (54) 

i=d+l i=d+l 

n 

< J2 Ht,{Pe,^) + Pe4og\X\ (55) 
i=d+l 

<nHi,{e) + nelog\X\ (56) 
= S{e). (57) 

The first inequality (l54l) follows from the fact that Zu is a function of y and M by dH) and 
by conditioning reduces entropy; the second inequality (l55l) follows from Fano's inequality and 
the third from (|53] ). 

Finally, from (l50l),(|5B,(|52l),(l57]) we obtain 

n 

nR>A+Y, [HiY^-d\Y'-''-'X'-') + /f(X,|F,_rfX|l]+J] 

i=d+l 
n 

-B-Y, [HiYUy'-'-'X'-') + n5{t)\ 

i=d+l 

n 

= A-B^Y. ii{X^,Y,^,X\-X;)^nb{e), 

i=d+\ 

which concludes the proof. ■ 

Appendix B 

Proof of Converse for Proposition [3] and Proposition m 

We prove the converse for Proposition HI since Proposition [3] follows as a special case. We 
focus on d = 1, since the proof for = can be obtained in a similar fashion. To this end, fix 



July 19, 2012 



DRAFT 



28 



a code (1, n, R, AR, Di + e, D2 + e) as defined in Sec. HIl Using the definition of encoder ([3]) 
and decoder dH) we have 

nR > H{M) = I{M; X"y") 

n 

= I{M; F") + /(M; 
1=1 

n 

i=l 
n 

= H{Yi\Y,^i) - H{Yi\Y'-^M) + i7(Xi|r"X^-^) - i7(Xi|r"X^-^M) 

i=l 
n 

= Y H{nYi_^) - H{Y,\ZuY'-'M) + H{X,\Yi) - H{X,\Z,,U,Y^,_,) 

i=l 

n 

> - H{Y,\ZuY,^^) + /f(X,|F,F,_i) - /J(X,|Zi,f/,F,F,_i) (58) 

i=l 

n 

= Y ^(^*; + ^(^.; ZuU,\Y^t^i.,). (59) 

where we have defined Ui = [F/^^Fj^^^X^^^M]. All equalities above follow from standard prop- 
erties of the entropy and mutual information, while the inequality (|58] ) follows by conditioning 
reduces entropy. Following the similar steps, we obtain 

n{R + AR) > H{M) + H{M^) > H{MMa) = /(MMa; X"F") 

n 

= Y HMM^; F") + J(M; X"|F") 

i=l 

n 

= YHiYi\Yi_i) - H{Yi\Y'-^MMA) + i/(Xi|r"X'"i) - i/(Xi|y"X*~iMMA) 

i=l 

n 

= Y,H{Yi\Yi^i) - H{Y,\ZuY'-'MMa) + H{Xi\Yi) - H{Xi\ZuZ2iUiYiYi.,MA) 

i=l 
n 

> ^/7(r,|K,_i) - H{Y,\Zi,Y,^i) + H{X,\Y^,^i) - H{X,\ZuZ2^U,Y^,^,) 

i=l 
n 

= Zh|F,_i) + I{Xf, ZuZ2my^^-l)■ (60) 

i=l 

The proof is concluded by introducing a time-sharing variable T uniformly distributed in [l,n] 
and defining random variables X = Xx, Y = Yt, Yi = Zi = Zit and Z2 = Z2T, and 



July 19, 2012 



DRAFT 



29 



by leveraging the convexity of the mutual informations in (|59l ) and (|601 ) with respect to the 
distribution p^zu, Z2i, Ui\xi, yi, yi_i). ■ 

Appendix C 
Proof of (145])-(146]) 

Here we prove that (|45l) -(l46l) equals (|42l) for the binary hidden Markov model of Sec. IV-AI 
First, for Di > rmn{£^'^^ * q, 1 — e*^"') * q} = ^^'^^ * q, we can simply set Zi = to obtain 
I{X;Zi\Yd) = and E[X ^ Zi] < Di, which, from ([45]) and the non-negativity of mutual 
information, leads to R^^^ (Di) = 0. Similarly, for Di > min{e^'^^ *q, 1 — e*^'^-' *q} = 1 — e'^'^^ * q, 
we can set Zi = 1 (B Yd to prove that = 0. For the remaining distortion levels Di < 

mm{e^''-^ * q, 1 — e^'^^ * q}, under the constraint that E[X ^ Zi] < Di, we have the following 
inequalities 

/(X; Z,\Yd) = H{X\Yd) - H{X\YdZ,) (61) 

= Hbis^''^ * q) - HtiX © Zi\YdZ,) (62) 

> Hbie^''^ * q) - H,{X © Zi) (63) 

>H,{e^^'>*q)-Ht,{D^), (64) 

where the third line follows by conditioning decreases entropy and the last line from the fact 
that H(x) is increasing in x for x < 1/2. This lower bound can be achieved in (|42)) by choosing 
the test channel p{zi\x,yd) so that X can be written as 

X = Yd(BS®Zi, (65) 

where S is binary with ps'(l) = Di and independent of Zi and Yd, and Zi is also independent 
of Yd. To obtain (1), we need to impose that the joint distribution p(x, yd) is preserved by the 
given choice of p{zi\x, yd). To this end, note that the joint distribution p(x, yd) is such that we 
can write X = Yd®Q, where Q is binary and independent of Yd, with Pq{1) = e^^^*q. Therefore, 
preservation of p{x,yd) is guaranteed if the equality Fi[S © Zi = 1] = ^^^(l) * Di = e^'^^ * q 
holds. This leads to 

*q-Di 

We remark that < p^^(l) < 1, due to the inequality (|46l) on the distortion Di. This concludes 
the proof. ■ 
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Appendix D 
Proof of (1491) 

Here we prove that (l49l) equals (021) for the hidden Gauss-Markov model of Sec. IV-Bl This 
follows by using analogous arguments as done above for the binary hidden Markov model. The 
only non-trivial adaptation of the proof given above is the choice of the test channel for the case 
where Di < 1 — p'^'^ + a%. This must be selected so that X can be written as 

X = /Frf + ^ + Zi, (67) 

where S is zero-mean Gaussian with E[S'^] = Di and independent of Zi and Y^, and Zi is also 
zero-mean Gaussian and independent of Y^. To obtain E[Zj], we need to impose that the joint 
distribution of X and Yd is preserved by the given choice of the test channel. To this end, note 
that the joint distribution of X and Y^ is such that we can write X = p'^Y^ + Q + N, where Q is 
zero-mean Gaussian and independent of Yd and A^, with E[Q'^] = \ — p^'^. Therefore, preservation 
of the joint distribution of X and Yd is guaranteed if the equality E[Zl] + Z^i = 1 — p^'^ + ct|^ 
holds. This leads to 

E[Zl] = l-p"' + al-D^. (68) 
We remark that < E[Zf] < 1, due to the assumed inequality on the distortion Di. ■ 

References 

[1] R. Venkataramanan and S. S. Pradhan, "Source coding with feed-forward: Rate-distortion theorems and error 
exponents for a general source," IEEE Trans. Inform. Theory, vol. 53, no. 6, pp. 2154-2179, Jun. 2007. 

[2] R. Venkataramanan and S. S. Pradhan, "Directed information for communication problems with side-information 
and feedback/feed-forward," in Proc. of the 43rd Annual Allerton Conference, Monticello, IL, 2005. 

[3] T. Berger, Rate Distortion Theory, Prentice-Hall, Englewood Cliffs, NJ, 1971. 

[4] R. M. Gray, "Conditional rate-distortion theory," Stanford Univ., Stanford, CA, Electronics Laboratories Tech. 
Rep. 6502-2, Oct. 1972. 

[5] N. Merhav and T. Weissman, "Coding for the feedback Gel'fand-Pinsker channel and the feedforward Wyner-Ziv 

source," IEEE Trans. Inform. Theory, vol. 52, no. 9, pp. 4207-4211, Sept. 2006. 
[6] A. H. Kaspi, "Rate-distortion function when side-information may be present at the decoder," IEEE Trans. Inform. 

Theory, vol. 40, no. 6, pp. 2031-2034, Nov. 1994. 
[7] A. Maor and N. Merhav, "On successive refinement with causal side Information at the decoders, IEEE Trans. 

Inform. Theory, vol.5 4, no. 1, pp. 332-343, Jan. 2008. 
[8] T. Weissman and N. Merhav, "On competitive prediction and its relation to rate-distortion theory," IEEE Trans. 

Inform. Theory, vol. 49, no. 12, pp. 3185- 3194, Dec. 2003. 



July 19, 2012 



DRAFT 



31 



[9] I. Naiss and H. Permuter, "Computable bounds for rate distortion with feed-forward for stationary and ergodic 
sources," |arXiv:1106.0895V l. 

[10] R. Venkataramanan and S. S. Pradhan, "On computing the feedback capacity of channels and the feed-forward 

rate-distortion function of sources," IEEE Trans. Commun., vol. 58, no. 7, pp. 1889-1896, Jul. 2010. 
[11] T. Weissman and A. El Gamal, "Source coding with limited-look-ahead side information at the decoder," IEEE 

Trans. Inform. Theory, vol. 52, no. 12, pp. 5218-5239, Dec. 2006. 
[12] S. S. Pradhan, "On the role of feedforward in Gaussian sources: Point-to-point source coding and multiple 

description source coding," IEEE Trans. Inform. Theory, vol. 53, no. 1, pp. 331-349, Jan. 2007. 
[13] T. Cover and J. Thomas, Elements of Information Theory, Wiley-Interscience, 2006. 
[14] A. El Gamal and Y.-H. Kim, Network Information Theory, Cambridge University Press, 2012. 
[15] G. Kramer, "Capacity results for the discrete memoryless network," IEEE Trans. Inform. Theory, vol.49, no. 1, pp. 

4-21, Jan. 2003. 

[16] R. Timo and B.N. Tellambi, "Two lossy source coding problems with causal side-information," in Proc. IEEE Int. 

Symposium on Inform. Theory, (ISIT 2009), pp. 1040-1044, Seoul, South Korea. 
[17] Y. Steinberg and N. Merhav, "On successive refinement for the Wyner-Ziv problem," IEEE Trans. Inform. Theory, 

vol.50, no. 8, pp. 1636- 1654, Aug. 2004. 
[18] A. Maor and N. Merhav, "On successive refinement for the Kaspi/Heegard-Berger problem," IEEE Trans. Inform. 

Theory, vol. 56, no. 8, pp. 3930-3945, Aug. 2010. 
[19] R. Gray, "Information rates of autoregressive processes," IEEE Trans. Inform. Theory, vol. 16, no. 4, pp. 412- 

421, Jul. 1970. 

[20] J. Rissanen, "A universal data compression system," IEEE Trans. Inform. Theory, vol. 29, no. 5, pp. 656- 664, 
Sept. 1983. 

[21] D. Vasudevan, "Bounds to the rate distortion tradeoff of the binary Markov source," in Proc. Data Compression 

Conference (DCC '07), pp. 343-352, 27-29 Mar. 2007. 
[22] R. G. Gallager, Discrete stochastic processes, Kluwer Academic Publishers, 1996. 

[23] W. H. R. Equitz and T. M. Cover, "Successive refinement of information," IEEE Trans. Inform. Theory, vol. 37, 
no. 2, pp. 269-275, Mar. 1991. 

[24] H. Permuter, Y.-H. Kim and T. Weissman, "Interpretations of directed information in portfolio theory, data 
compression, and hypothesis testing," IEEE Trans. Inform. Theory, vol. 57, no. 6, pp. 3248-3259, Jun. 2011. 

[25] F. M. J. Willems, Y. M. Shtarkov, and T. J. Tjalkens, "The context-tree weighting method: basic properties," IEEE 
Trans. Inform. Theory, vol. 41, no. 3, pp. 653-664, May 1995. 

[26] A. J. Goldsmith and P. P. Varaiya, "Capacity of fading channels with channel side information," IEEE Trans. 
Inform. Theory, vol. 43, pp. 1986-1992, Nov. 1997. 

[27] D. Vasudevan, C. Tian, and S. Diggavi, "Lossy source coding for a cascade communication system with side- 
informations," in Communication, Control, and Computing, 2006 44th Annual Allerton Conference on, Sept. 2006 



July 19, 2012 



DRAFT 



